1 Description of Data

The data are collected from the Chinese People’s Daily newspaper for year 2019 and 2020. The daily newspapers are published on a consistent structure (we give the details of the structure below) online on the newspaper’s website. Despite the slow loading speed of the website, We tapped into its structure and have scraped the first two pages of the daily newspapers.

For instance, the 1st page of the issue published on 2020-03-22 can be accessed here.

Example newspaper page published on 2020-03-22 (left), and Section or column 1 of it enlarged (right).Example newspaper page published on 2020-03-22 (left), and Section or column 1 of it enlarged (right).

Figure 1.1: Example newspaper page published on 2020-03-22 (left), and Section or column 1 of it enlarged (right).

We dissect the different characteristics of this particular article as follows and these properties apply to all other articles.

First of all, the main link to the article is this one: (http://paper.people.com.cn/rmrb/html/2020-03/22/nbs.D110000renmrb_01.htm). This takes us to one page of the newspaper, i.e., to the 1st page of the newspaper in this particular example. However, a lot of content is packed on this single page. There are 10 different sections or columns on this article (see Figure 1.1 for this example newspaper page published on 2020-03-22).

As we can see in Figure (1.1) there are several different sections crammed or squeezed into the single page of the newspaper and these parts (sections) are clickable, and each has a unique id. A click on each section will redirect to a link where one can access the full content of the section in an enlarged view. For example, the first section out of the 10 sections on this example article is section 1 or (see Figure 1.1 –right). The section id for this first section is nw.D110000renmrb_20200322_1-01.

Accordingly, the id of each section on an article is of this form: nw.D110000renmrb_yyyymmdd_section#-page#. Each section of the single page newspaper has the following additional characteristics.

The maximum number of sections on a page is 15, in an article published on 2019-04-26, and the minimum is 1. On average, there were 5.99 sections per a page of an article, for the newspapers published since January 1st, 2019 regardless of the page number. Looking at the frequencies of articles individually, the tendency is for page 2 to have less issues with very high numbers of articles above around 9.

Therefore, the contents of all the sections (paragraph or p-tags) together–on the page of the news article– make up the contents of the entire (single) page–which are compactly placed in a single page of the article. Particularly, we scraped the sections of the newspapers–through their unique ids. The date and the page number of the newspaper uniquely identify a newspaper, where the combination of which forms the ids of the sections–prefixed with nbs.D110000renmrb.

Moreover, most of the news articles are bulky in terms of number of paragraphs and text volume. There are 82.71 number of paragraphs per page, and 16.62

2 Descriptive Statistics

Table 2.1: Descriptive Statistics: How bulky is the daily newspaper?
Page 1
Page 2
Variable median mean sd max min median mean sd max min
2019
num_of_paragraphs 77.0 91.9 55.8 367.0 20.0 70.0 76.6 30.5 246 9.0
num_of_sections 6.0 6.2 1.8 15.0 1.0 6.0 5.7 2.1 9 1.0
paragraph_per_section 12.7 15.6 9.4 65.5 4.1 12.7 18.5 23.5 246 4.2
words 2898.0 3326.3 1756.0 16190.0 803.0 2537.0 2661.3 888.1 8675 219.0
2020
num_of_paragraphs 69.0 83.1 48.8 341.0 30.0 66.0 70.2 26.8 204 9.0
num_of_sections 7.0 6.4 1.7 11.0 2.0 6.0 6.0 2.0 9 1.0
paragraph_per_section 11.3 13.9 8.9 56.8 4.2 10.0 15.9 22.7 204 4.0
words 2871.0 3137.7 1212.0 9732.0 1508.0 2446.0 2439.3 651.6 4270 232.0
Distribution of **number of sections** in a page of a newspaper

Figure 2.1: Distribution of number of sections in a page of a newspaper

Distribution of Number of Paragraphs in the Newspaper Page

Figure 2.2: Distribution of Number of Paragraphs in the Newspaper Page

Word counts in the page of the newspaper per day

Figure 2.3: Word counts in the page of the newspaper per day

How bulky is a page of a newspaper in terms of word counts?

Figure 2.4: How bulky is a page of a newspaper in terms of word counts?

Term Frequency Distribution per page of the newpaper

Figure 2.5: Term Frequency Distribution per page of the newpaper

Table 2.2: The 12 bigrams with the highest tf_idf in 2020
page_num month bigram n tf idf tf_idf
1st Feb outbreak prevention 1016 0.01127 0.40547 0.00457
1st Apr basic law 19 0.00162 2.48491 0.00403
1st Jan xi jinping 919 0.00922 0.40547 0.00374
2nd Apr 培 玉 17 0.00145 2.48491 0.00361
2nd Apr 赵 培 17 0.00145 2.48491 0.00361
1st Feb spotter sue 111 0.00123 2.48491 0.00306
2nd Feb outbreak prevention 468 0.00519 0.40547 0.00210
1st Mar labor education 75 0.00072 2.48491 0.00179
1st Mar crown pneumonia 260 0.00249 0.69315 0.00173
2nd Jan xi jinping 413 0.00414 0.40547 0.00168
1st Jan theme education 187 0.00188 0.87547 0.00164
2nd Mar crown pneumonia 222 0.00213 0.69315 0.00148
Table 2.3: The 12 trigrams with the highest tf_idf in 2020
page_num month trigram n tf idf tf_idf
2nd Apr 赵 培 玉 17 0.00238 2.48491 0.00591
2nd Apr 何 香 云 12 0.00168 2.48491 0.00417
2nd Apr flood control drought 10 0.00140 1.79176 0.00251
1st Mar crown pneumonia outbreak 202 0.00320 0.69315 0.00222
1st Jan party central committee 259 0.00445 0.40547 0.00180
2nd Feb health care professionals 134 0.00255 0.53900 0.00137
1st Jan china features socialist 189 0.00325 0.40547 0.00132
1st Jan era china features 108 0.00185 0.69315 0.00129
1st Mar rural community workers 43 0.00068 1.79176 0.00122
2nd Feb traditional chinese medicine 87 0.00166 0.69315 0.00115
1st Feb education supervision mechanism 24 0.00046 2.48491 0.00113
2nd Mar rural community workers 33 0.00052 1.79176 0.00094